Linking EORTC QLQ-C-30 and PedsQL/PEDQOL physical functioning scores in patients with osteosarcoma

Purpose The available questionnaires for quality-of-life (QoL) assessments are age-group specific, limiting comparability and impeding longitudinal analyses. The comparability of measurements, however, is a necessary condition for gaining scientific evidence. To overcome this problem, we assessed the viability of harmonising data from paediatric and adult patient-reported outcome (PRO) measures. Method To this end, we linked physical functioning scores from the Paediatric Quality of Life Inventory (PedsQL) and the Paediatric Quality of Life Questionnaire (PEDQOL) to the European Organisation for Research and Treatment of Cancer Core Questionnaire (EORTC QLQ-C30) for adults. Samples from the EURAMOS-1 QoL sub-study of 75 (PedsQL) and 112 (PEDQOL) adolescent osteosarcoma patients were concurrently administered both paediatric and adult questionnaires on 98 (PedsQL) and 156 (PEDQOL) occasions. We identified corresponding scores using the single-group equipercentile linking method. Results Linked physical functioning scores showed sufficient concordance to the EORTC QLQ-C30: Lin's ρ = 0.74 (PedsQL) and Lin's ρ = 0.64 (PEDQOL). Conclusion Score linking provides clinicians and researchers with a common metric for assessing QoL with PRO measures across the entire lifespan of patients.


Introduction
Quality-of-life (QoL) data are generally collected by self-report questionnaires. Health-related QoL questionnaires can be age-group specific. This age group specificity limits comparability and impedes numerical longitudinal analysis, especially if different instruments are needed to span the age range of the study. Specifically, the motivation for linking scores from paediatric and adult instruments was to make them comparable on a common scale, allowing the study of the QoL developmental trajectory continuously and permitting the analysis with mixed models.
The use of different instruments constitutes a considerable hurdle for the analysis and interpretation of QoL data, since "[t]he comparability of measurements made in differing circumstances by different methods and investigators is a fundamental precondition for all of science" [1]. Therefore, valid methods for linking scores are required.
Dorans provides an overview of applying linking methodology within the realm of patient-reported outcome (PRO) measures [2] (Table 1a).
In the present study, we evaluated the viability of linking physical functioning scores of two paediatric PRO questionnaires (the PedsQL and the PEDQOL) to the EORTC QLQ-C30) in a population of survivors of childhood osteosarcoma. We restrict our report to the physical functioning domain because we were mainly interested in the viability of linking paediatric and adult instruments. We provide information on linking emotional functioning, cognitive functioning, social functioning, fatigue and pain domains in the appendix.

Materials and methods
The overall study design [25,26] and the methodological specifics of the QoL questionnaire sub-study have been laid out in detail previously [27]. We briefly describe the study design.

Participants
The EURAMOS-1 trial cohort consisted of 2260 participants who, between the ages 5 and 40 years old, had been diagnosed with a previously untreated resectable high-grade osteosarcoma (at any site, except for craniofacial structures). Among these, 2213 participants were eligible for QoL-assessment (!5 years old) and had a questionnaire in their respective language available (see [27]). Recruitment took place between 2005 and 2011, involving 17 countries and four study groups: the Children's Oncology Group (COG), the Cooperative Osteosarcoma Group (COSS), the European Osteosarcoma Intergroup (EOI), and the Scandinavian Sarcoma Group (SSG). EURAMOS-1 consortium members and their affiliations are listed in Appendix E.1. We obtained demographics from the EURAMOS-1 enrolment survey (sex, date of birth, and study group). Age was stratified as "5 to 15", "16 to 17" and "18 or older". As a secondary outcome measure, QoL was assessed prospectively at four time points during and after treatment (Fig. 1).

Questionnaires
Due to the unavailability of a single questionnaire suited for use across the whole age span of participants and in all participating countries, the EURAMOS-1 consortium opted for using different, age-and countryspecific instruments (Table 2a).
In the age range 16e18 years old, all patients were asked to complete a paediatric questionnaire (either PedsQL [28] or PEDQOL [29]) and the EORTC QLQ-C30 [30]. We used this sub-sample for score linking. We restricted our study to aggregate scores pertaining to physical functioning, given its significance to QoL in osteosarcoma survivors and the substantial conceptual overlap between instruments in this domain. We linked two sub-sets of participants aged 16e17 years. These sub-sets were administered either the PedsQL or the PEDQOL questionnaire before the EORTC QLQ-C30 on the same day.

Analyses
2.3.1. Similarity of item content and physical functioning sub-scale structure between instruments The PedsQL, the PEDQOL and the EORTC QLQ-C30 all contain items that assess the physical functioning domain with multiple items (for details on scoring, see Table 2b and for verbatim item content see Appendix F).
Item content showed substantial overlap across the three measures. To measure internal consistency of the instruments, we calculated Cronbach's a. A summary of the results is given in Table 2c. 2.3.2. Summary of physical functioning raw scores, correlation and concordance between instruments The overall mean physical functioning score, i.e. across all four time points, was 51.6 (SD Z 22.7) for the PedsQL and 74.3 (SD Z 22.3) for the corresponding EORTC QLQ-C30 (n Z 98). The overall mean for physical functioning of the PEDQOL was 46.8 (SD Z 25.1) and the corresponding EORTC QLQ-C30 overall mean was 63.5 (SD Z 27.2) (n Z 156).
The correlations between the EORTC QLQ-C30 physical functioning sub-scale and the corresponding aggregate scores of the two paediatric instruments were both good, but the PedsQL physical functioning raw scores correlated more strongly (r Z 0.73; 95% confidence interval (CI): 0.63, 0.81) than those of the PED-QOL (r Z 0.64; CI: 0.54, 0.73). The physical functioning raw scores of the paediatric questionnaires showed only moderate agreement with those of the EORTC QLQ-C30 before linking, with similar values for the PedsQL (Lin's r Z 0.49; CI: 0.63, 0.81) and the PEDQOL (Lin's r Z 0.53; CI: 0.43, 0.63). Given a substantial overlap in item content, we linked the respective aggregate physical   Table 1 Publications on linking PRO measures.

Linking design
To produce physical functioning crosswalks (score conversion tables), we linked scores of those participants who had completed one of the two paediatric instruments and the EORTC QLQ-C30 at the same time point. This group consisted of participants who were 16e18 years old. This linking technique, referred to as the single-group design, is akin to a repeated measures design with a single group and two treatments [31]. It is considered the most valid linking design because the scores of identical individuals are linked, thus requiring the smallest sample size to achieve the same level of accuracy as designs with a lesser degree of group equivalency [32].
To ensure that the instruments to be linked showed sufficient conceptual congruity [2], we employed two methods, modelling our approach on Choi et al. (2014) and Marrie et al. (2020). First, we reviewed the content of the physical functioning items of the three instruments to ensure that they indeed measure approximately the same concept. Second, to assess internal consistency, we calculated Cronbach's a for the three questionnaires.

Linking function
We performed identity, mean, linear, equipercentile and circle-arc linking procedures (Fig. 2). Previously, we had applied log-linear pre-smoothing to three moments to adjust for potential sampling error introduced by uneven score distributions [33]. Log-linear pre-smoothing is a recommended procedure for small samples such as ours because a smoothed distribution yields more reliable results [33]. We used root mean square error (RMSE) by means of parametric bootstrapping to determine the best linking method (for details see [34], 5.7).
We chose the equipercentile linking method to produce crosswalk tables, as it emerged as the method with the most favourable linking quality parameters, overall.

Evaluation of linking quality
As a first step towards ascertaining the agreement between paediatric and adult QoL instruments, we created BlandeAltman plots [35] (Fig. 3 and Table 2d). We plotted the differences (y-axis) for scores linked from each paediatric questionnaire and those measured by the EORTC QLQ-C30 against subject means (x-axis) to check for patterns and distributions. Following Zhou et al. [36], we established that the limits of agreement for linked and measured scores were to be considered "good" if they fell within one standard deviation (SD) of the mean of measured EORTC QLQ-C30 scores, "fair" if they did not extend beyond two SDs, and "poor", otherwise.  Table 2c Internal consistency reliability of the physical functioning aggregate scores of the three instruments. Additionally, we calculated Pearson's correlation coefficient r and Lin's concordance correlation coefficient between each of the two paediatric measures and the EORTC QLQ-C30.
We prepared histograms of the differences between measured and linked EORTC QLQ-C30 scores to visually inspect whether the distributions approximate normality (Fig. 4).
Details on software are given in Appendix A.

Participant characteristics
The QoL sub-sample consisted of 2213 osteosarcoma patients.  Table 3a gives an overview of patient characteristics by linked questionnaire (PedsQL or PEDQOL) for the physical functioning domain, including sex, age, and study group, overall and by linked sub-sample.

BlandeAltman plots
We used BlandeAltman plots to compare PedsQL and PEDQOL scores to EORTC QLQ C-30 scores. The interpretation of BlandeAltman plots is premised on normality and homoscedasticity of the distribution. We prepared histograms for the distributions of differences ( Fig. 4 and Table 2d) to make a first visual assessment. We then prepared BlandeAltman plots (Fig. 3) displaying the differences in scores between each paediatric instrument and the EORTC QLQ-C30 against the respective means.
To inspect for heteroscedasticity, we prepared quantileequantile (QeQ) plots ( Fig. 5) for differences between scores linked from the two paediatric questionnaires and EORTC QLQ-C30 scores. We judged that scores linked from the PEDQOL displayed adequate homoscedasticity. However, scores linked from the PedsQL indicated an uneven, left-skewed distribution. Therefore, we log-transformed the score differences, achieving better overall homoscedasticity, albeit with a remaining left skew (Fig. 6). To account for the presence of substantial heteroscedasticity in scores linked from the PedsQL, we prepared a BlandeAltman plot on log-transformed data (Fig. 7a) which indicated a better fit of limits of agreement. Given that logtransformed scores do not lend themselves to easy interpretation for clinical practice, we additionally plotted the score differences in a conventional BlandeAltman plot on the original scale with backtransformed limits of agreement (Fig. 7b) [37,38].
Summarily, we judged agreement for physical functioning scores acceptable, as the limits of agreement did    not extend beyond two standard deviations of EORTC QLQ-C30 scores for either of the paediatric instruments, and the majority of scores being within one standard deviation of EORTC QLQ-C30 scores.

Correlations between physical functioning aggregate scores of paediatric and adult instruments
Additionally, we calculated Pearson's r and Lin's rÀ [39] concordance correlation coefficients between the EORTC QLQ-C30 and the PedsQL and PEDQOL physical functioning converted scores. The correlation coefficients for physical functioning scores were good for both the PedsQL and the PED-QOL to EORTC QLQ-C30 conversions, with a Lin's r of 0.74 and 0.64, respectively (Table 3b and 3c).

Correlations between other aggregate scores of paediatric and adult instruments
The converted scores of the PedsQL and PEDQOL fatigue both correlated well with EORTC QLQ-C30 scores (Lin's r Z 0.69 and Lin's r Z 0.71). Correlation coefficients for pain were moderate for the PedsQL (Lin's r Z 0.58) and good for the PEDQOL (Lin's r Z 0.73). Correlation coefficients for emotional functioning were moderate (Lin's r Z 0.55) for the PedsQL and fair for PEDQOL (Lin's r Z 0.36) conversions to EORTC QLQ-C30 scores. The correlation of converted cognitive functioning scores with EORTC QLQ-C30 scores was fair for the PedsQL (Lin's r Z 0.37) and moderate for the PEDQOL (Lin's r Z 0.47). Converted social functioning scores correlated poorly with EORTC QLQ-C30 scores for both, the PedsQL (Lin's r Z 0.17) and the PEDQOL PedsQL (Lin's r Z 0.08).

Discussion
Data harmonisation provides a number of benefits by permitting the pooling of data, such as answering novel research questions or increasing statistical power. Despite a growing interest in harmonising data, retrospective data harmonisation (after data collection) is the rule and prospective harmonisation (before data collection) the exception [3]. While it may be due to a lack of foresight or practicability that retrospective data  1 The columns pertain to those participants whose PedsQL or PEDQOL scores were linked to their respective EORTC QLQ-C30 scores. Therefore, the table does not contain a separate column for EORTC QLQ-C30 scores. 2 Age refers to the age at the time of registration for participation in the study.
harmonisation remains the only option, harmonising data prospectively may also be inherently impossible. This was the case in the international research collaboration the present study grew out of which included longitudinal QoL assessments in adult survivors of childhood osteosarcoma. The use of different PRO measures during childhood and adulthood was unavoidable, as no suitable instrument for both age groups existed.
To obtain harmonised data retrospectively, we linked the scores from two paediatric PRO measures to an adult PRO measure to assess the quality of life across the lifespan of osteosarcoma survivors. Visual and numerical concordance assessments indicated good agreement between physical functioning aggregate scores. The equipercentile linking method yielded the best overall results for this sample. Sub-sets consisting of 75 (PedsQL) and 112 participants (PEDQOL) yielded 98 (PedsQL) and 156 (PEDQOL) score pairings between paediatric and adult questionnaires and were sufficient to permit score linking for the whole cohort and enabled the analysis of QoL data for a forthcoming publication.
In domains other than physical functioning, the concordance estimates obtained with Pearson's r diverged from those obtained with dedicated concordance coefficients (Appendix , Table D.1), thus confirming that Pearson's r is not a useful measure for assessing intra-individual agreement. The Pearson correlation coefficient (Pearson's r) is generally not considered a suitable measure of concordance because it is only informative if the relationship between two variables is linear, thus potentially leading to incorrect conclusions in case of non-linearity. Crucially, Pearson's r only evaluates the extent of a linear relationship on a population level, ignoring intra-individual concordance. Despite its apparent shortcomings, Pearson's r continues to be widely employed in the score linking literature as a measure of agreement between two instruments. This is all the more surprising, given that non-linear score linking methods were presumably developed to specifically account for non-linear agreement between two instruments. Due to its continued popularity and to underscore differences between concordance measures, we nevertheless included Pearson's r alongside Lin's concordance correlation coefficient r [39] which we consider more apt. We provide an evaluation according to value ranges to allow a verbal interpretation, similar to the kappa concordance  coefficient for binary variables [35], with five categories, ranging from "Poor" to "Very Good" (Table 3c).
Building on McNemar's coefficient of alienation, Dorans [40] defined "Reduction in Uncertainty". Since a 50% reduction in uncertainty, as measured in score units, requires a Pearson's r of at least 0.866, Dorans recommended a correlation of this magnitude as an appropriate lower bound. This recommendation was made in the context of high-stakes educational testing, as Choi and colleagues [12] have pointed out. For linking health outcome measures, they suggesteda correlation of 0.75e0.80 as an appropriate minimum, given that aggregate outcomes are the focus of interest, and in particular when using a single-group design which permits the direct evaluation of accuracy.
A limitation of our study is that our results may not be population invariant, i.e. the linking quality parameters we obtained may not generalise to other populations. Previous studies linking PedsQL or PEDQOL physical functioning aggregate scores to the EORTC QLQ-C30 are lacking. Therefore, we were unable to draw comparisons to similar or dissimilar populations and we cannot generalise our findings beyond the highly selective clinical population our sample was drawn from. The aim of our study was to evaluate the feasibility of linking paediatric and adult PRO measures within a population of osteosarcoma survivors. Clearly, our findings are restricted to this narrowly circumscribed area of clinical practice and research. The methodology also does not allow for harmonisation in completely disparate age groups (e.g. 5-10 year-old with 35e40 year-old).
The use of age-adequate (i.e. age-specific) questionnaires for children seems unavoidable, rendering a direct comparison of paediatric and adult scores in survivors of childhood cancer inherently impossible. Therefore, we see the potential general utility of score linking in this field in offering interoperability of paediatric and adult PRO measures, and the specific value of this study in showing the viability of this approach for the first time. Having established its feasibility, the approach described may be integrated in future study designs involving dissimilar populations. Doing so may yield evidence regarding the population invariance of our results.
Another limitation of our study is that we cannot rule out an order effect, i.e. the relationship of the instruments may have depended on the order of their administration. This point should be addressed in future investigations by randomising the order of administration. In a similar vein, the administration of two questionnaires at the same point may have biased the responses to the second questionnaire. Randomising the order of administration should also reduce fatigue bias, by equalising the directionality of such an effect between the instruments.
We consider the single-group design a major strength of our study, as it provides the firmest methodological grounds for score linking. Its inherent potential   Table 3c Interpretation of concordance correlation coefficients [35]. Very good disadvantages should be balanced against its strengths and against the weaknesses of alternative linking designs. Using a single-group design, we obtained actual and linking-derived scores from the same population. This allowed us to evaluate the accuracy of our linking functions directly. As a tangible product, we created crosswalk tables between PedsQL and PEDQOL physical functioning aggregate scores (see Appendix, Table  L.1) which will bring forward data harmonisation and will enable us to perform longitudinal analyses within the EURAMOS-1 cohort.

Concordance
With score linking, it is possible to directly compare scores of osteosarcoma patients obtained with distinct age-group-specific inventories and observe their QoL across the entire lifespan. The approach may create the conditions for conducting longitudinal mixed-model meta-analyses. We consider score linking a promising tool for assuring comparability of intra-individual QoL assessments in studies over time and extending across different stages of life. We anticipate that oncological QoL research may strongly benefit from score linking.

Funding
The study sponsor was the UK Medical Research Council in Europe and the US National Cancer Institute in North America and Australia. Each trial group organised local coordination elements; central coordination and analysis was led from Medical Research Council Clinical Trials Unit at University College of London. Neither the sponsors nor the funders of the trial had a role in trial design, data analysis, or data interpretation. The EURAMOS-1 is an academic clinical trial funded through multiple national and international government agencies and cancer charities: -Children's Oncology Group funding for the EURAMOS-1 trial (AOST0331) was supported by the National Clinical

Conflict of interest statement
The authors declare the following financial interests/ personal relationships which may be considered as potential competing interests: SB reports grants from Deutsche Krebshilfe, Deutsche Forschungsgemeinschaft, and European Science Foundation during the conduct of the study and personal fees from Lilly, Bayer, Pfizer, Novartis, Isofol, Clinigen, Sensorion, Ipsen, and Roche outside the submitted work. MRS reports grants and nonfinancial support from Astellas, grants from Clovis, grants and nonfinancial support from Janssen, grants and nonfinancial support from Novartis, grants and nonfinancial support from Pfizer, and grants and nonfinancial support from Sanofi, during the conduct of the study and personal fees from Lilly Oncology and personal fees from Janssen for educational courses and workshops outside the submitted work. NM reports employment by Five Prime Therapeutics, Inc and Sanofi US, outside the submitted work. The remaining authors declare no conflicts of interest.

Acknowledgements
The authors thank all the patients and parents for their contribution and all data managers, especially Ms Eva-Mari Olofsson from the SSG office in Lund, for their support to collect the data from all the patients in timely order.

A. Software
We conducted all statistical analyses using version 4.1.0 of the R platform, version 2.0.7 of R package equate for score linking, R package blandr for the calculation of concordance correlation coefficients and the tidyverse suite of R packages for data preparation and data visualisation.
B. Characteristics of participants by domain, overall and linked.      C. Internal consistency reliability of the three instruments by domain.    E. EURAMOS-1 consortium. In the past ONE month, how much of a problem has this been for you .   G. Emotional functioning items per questionnaire.

G.1 PedsQL emotional functioning items.
In the past ONE month, how much of a problem has this been for you .   In the past ONE month, how much of a problem has this been for you .   I. Social functioning items per questionnaire.

I.1 PedsQL social functioning items.
In the past ONE month, how much of a problem has this been for you .    J. Fatigue items per questionnaire.
In the past ONE month, how much of a problem has this been for you .   K. Pain items per questionnaire.

K.1 PedsQL pain items.
In the past ONE month, how much of a problem has this been for you .   L. Crosswalks between the PedsQL/the PEDQOL and the EORTC QLQ-C30.        Score differences are defined as the paediatric instrument as less than the adult instrument. 3 The US-English versions of are displayed here in an exemplary fashion.